neural network loss landscape
Large Scale Structure of Neural Network Loss Landscapes
There are many surprising and perhaps counter-intuitive properties of optimization of deep neural networks. We propose and experimentally verify a unified phenomenological model of the loss landscape that incorporates many of them. High dimensionality plays a key role in our model. Our core idea is to model the loss landscape as a set of high dimensional \emph{wedges} that together form a large-scale, inter-connected structure and towards which optimization is drawn. We first show that hyperparameter choices such as learning rate, network width and $L_2$ regularization, affect the path optimizer takes through the landscape in similar ways, influencing the large scale curvature of the regions the optimizer explores. Finally, we predict and demonstrate new counter-intuitive properties of the loss-landscape. We show an existence of low loss subspaces connecting a set (not only a pair) of solutions, and verify it experimentally. Finally, we analyze recently popular ensembling techniques for deep networks in the light of our model.
Taxonomizing local versus global structure in neural network loss landscapes
Viewing neural network models in terms of their loss landscapes has a long history in the statistical mechanics approach to learning, and in recent years it has received attention within machine learning proper. Among other things, local metrics (such as the smoothness of the loss landscape) have been shown to correlate with global properties of the model (such as good generalization performance). Here, we perform a detailed empirical analysis of the loss landscape structure of thousands of neural network models, systematically varying learning tasks, model architectures, and/or quantity/quality of data. By considering a range of metrics that attempt to capture different aspects of the loss landscape, we demonstrate that the best test accuracy is obtained when: the loss landscape is globally well-connected; ensembles of trained models are more similar to each other; and models converge to locally smooth regions. We also show that globally poorly-connected landscapes can arise when models are small or when they are trained to lower quality data; and that, if the loss landscape is globally poorly-connected, then training to zero loss can actually lead to worse test accuracy.
Reviews: Large Scale Structure of Neural Network Loss Landscapes
This paper takes a unique approach and aims high. In deep learning, there are all these intriguing empirical observations previously known; the road most travelled to understand them is to prove these observations under certain assumptions, while the authors choose to link these observations through a descriptive model that otherwise could have nothing to do with neural networks. I appreciate this unique approach, which is actually a dominant approach in other sciences like physics. The descriptive model in this paper, if more accurate than not, could potentially simplify the conceptual understanding of optimization for deep learning and motivate new algorithms. There are two reasons why I cannot more enthusiastically recommend this paper.
Reviews: Large Scale Structure of Neural Network Loss Landscapes
The authors propose a phenemenological model of the loss landscape of DNNs - they devise the landscape as a set of high dimensional wedges whose dimension is slightly lower than the dimension of the full space, and how the optimizer traverses the loss landscape for common hyperparameter choices. Overall speaking, this paper provide interesting insights to deep learning, although it is not very clear how the insights could be used to improve the training process of deep neural networks yet (to both optimization and generalization). One problem with the paper is its presentation. Some of the reviewers have confusions after reading the paper. It would be critical for the authors to improve their writings (the text and figures) significantly in order to make it more accessible to the audience.
Taxonomizing local versus global structure in neural network loss landscapes
Viewing neural network models in terms of their loss landscapes has a long history in the statistical mechanics approach to learning, and in recent years it has received attention within machine learning proper. Among other things, local metrics (such as the smoothness of the loss landscape) have been shown to correlate with global properties of the model (such as good generalization performance). Here, we perform a detailed empirical analysis of the loss landscape structure of thousands of neural network models, systematically varying learning tasks, model architectures, and/or quantity/quality of data. By considering a range of metrics that attempt to capture different aspects of the loss landscape, we demonstrate that the best test accuracy is obtained when: the loss landscape is globally well-connected; ensembles of trained models are more similar to each other; and models converge to locally smooth regions. We also show that globally poorly-connected landscapes can arise when models are small or when they are trained to lower quality data; and that, if the loss landscape is globally poorly-connected, then training to zero loss can actually lead to worse test accuracy.
Large Scale Structure of Neural Network Loss Landscapes
There are many surprising and perhaps counter-intuitive properties of optimization of deep neural networks. We propose and experimentally verify a unified phenomenological model of the loss landscape that incorporates many of them. High dimensionality plays a key role in our model. Our core idea is to model the loss landscape as a set of high dimensional \emph{wedges} that together form a large-scale, inter-connected structure and towards which optimization is drawn. We first show that hyperparameter choices such as learning rate, network width and L_2 regularization, affect the path optimizer takes through the landscape in similar ways, influencing the large scale curvature of the regions the optimizer explores.
Large Scale Structure of Neural Network Loss Landscapes
Fort, Stanislav, Jastrzebski, Stanislaw
There are many surprising and perhaps counter-intuitive properties of optimization of deep neural networks. We propose and experimentally verify a unified phenomenological model of the loss landscape that incorporates many of them. High dimensionality plays a key role in our model. Our core idea is to model the loss landscape as a set of high dimensional \emph{wedges} that together form a large-scale, inter-connected structure and towards which optimization is drawn. We first show that hyperparameter choices such as learning rate, network width and $L_2$ regularization, affect the path optimizer takes through the landscape in similar ways, influencing the large scale curvature of the regions the optimizer explores.
Emergent properties of the local geometry of neural loss landscapes
Fort, Stanislav, Ganguli, Surya
Emergent properties of the local geometry of neural loss landscapesStanislav Fort Surya Ganguli Stanford University Stanford, CA, USA Stanford University Stanford, CA, USA Abstract The local geometry of high dimensional neural network loss landscapes can both challenge our cherished theoretical intuitions as well as dramatically impact the practical success of neural network training. Indeed recent works have observed 4 striking local properties of neural loss landscapes on classification tasks: (1) the landscape exhibits exactly C directions of high positive curvature, where C is the number of classes; (2) gradient directions are largely confined to this extremely low dimensional subspace of positive Hessian curvature, leaving the vast majority of directions in weight space unexplored; (3) gradient descent transiently explores intermediate regions of higher positive curvature before eventually finding flatter minima; (4) training can be successful even when confined to low dimensional random affine hy-perplanes, as long as these hyperplanes intersect a Goldilocks zone of higher than average curvature. We develop a simple theoretical model of gradients and Hessians, justified by numerical experiments on architectures and datasets used in practice, that simultaneously accounts for all 4 of these surprising and seemingly unrelated properties. Our unified model provides conceptual insights into the emergence of these properties and makes connections with diverse topics in neural networks, random matrix theory, and spin glasses, including the neural tangent kernel, BBP phase transitions, and Derrida's random energy model. 1 Introduction The geometry of neural network loss landscapes and the implications of this geometry for both optimization and generalization have been subjects of intense interest in many works, ranging from studies on the lack of local minima at significantly higher loss than that of the global minimum [1, 2] to studies debating relations between the curvature of local minima and their generalization properties [3, 4, 5, 6]. Fundamentally, the neural network loss landscape is a scalar loss function over a very high D dimensional parameter space that could depend a priori in highly nontrivial ways on the very structure of real-world data itself as well as intricate properties of the neural network architecture. Moreover, the regions of this loss landscape explored by gradient descent could themselves have highly atypical geometric properties relative to randomly chosen points in the landscape.
- North America > United States > California > Santa Clara County > Stanford (0.44)
- Oceania > Australia > New South Wales > Sydney (0.04)
- North America > United States > California > Los Angeles County > Long Beach (0.04)
- (2 more...)